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ABSTRACT 

Today, the publication of microdata poses a privacy threat. Vast 
research has striven to define the privacy condition that microdata 
should satisfy before it is released, and devise algorithms to anony- 
mize the data so as to achieve this condition. Yet, no method pro- 
posed to date explicitly bounds the percentage of information an 
adversary gains after seeing the published data for each sensitive 
value therein. This paper introduces /3-likeness, an appropriately 
robust privacy model for microdata anonymization, along with two 
anonymization schemes designed therefor, the one based on gen- 
eralization, and the other based on perturbation. Our model pos- 
tulates that an adversary's confidence on the likelihood of a certain 
sensitive-attribute (S A) value should not increase, in relative differ- 
ence terms, by more than a predefined threshold. Our techniques 
aim to satisfy a given /3 threshold with little information loss. We 
experimentally demonstrate that (i) our model provides an effective 
privacy guarantee in a way that predecessor models cannot, (ii) our 
generalization scheme is more effective and efficient in its task than 
methods adapting algorithms for the k-anonymity model, and (iii) 
our perturbation method outperforms a baseline approach. More- 
over, we discuss in detail the resistance of our model and methods 
to attacks proposed in previous research. 

1. INTRODUCTION 

Organizations, such as government agencies or hospitals, reg- 
ularly release microdata (e.g., census data or medical records) to 
serve benign purposes. However, such data can inadvertently re- 
veal sensitive personal information to malicious adversaries. Ex- 
perience has shown that merely concealing explicit identifying at- 
tributes, such as name or phone number, does not suffice to protect 
personal privacy. An attacker may still uncover hidden identities 
and/or sensitive information, by joining the released microdata at- 
tributes with other publicly available data. The set of attributes 
instrumental to that purpose, such as gender, zipcode, and age, are 
called quasi-identifiers (QIs). The anonymization problem calls for 
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bringing the data to a form that forestalls such linking attacks while 
preserving as much of the original information as possible. 

The question of the form the data should be brought to is a sub- 
ject of inquiry in itself. Past research has tried to formulate a pri- 
vacy guarantee an anonymized data set should satisfy, using syn- 
tactic and perturbation-based methods. 

Syntactic anonymization methods typically postulate that micro- 
data be partitioned into a set of equivalence classes (ECs), such that 
all tuples within an EC be indistinguishable from (or mutually inter- 
changeable with [33]) each other as far as their QIs are concerned. 
The models differ in the condition that an eligible EC should sat- 
isfy. By fc-anonymity, each EC should consist of at least k tuples 
[29]. In effect, fc-anonymity protects against identity disclosure, as 
it hides each released tuple in a crowd of at least k — 1 others, but 
does not attend to the values of a non-QJ sensitive attribute (SA); 
hence, the privacy regarding such values may be compromised. To 
address this limitation, ^-diversity requires that each EC contain 
at least I different "well represented" SA values (in a mathematical 
sense) [22]. Even so, ^-diversity fails to protect against attacks aris- 
ing from an adversary's unavoidable knowledge of each SA value's 
frequency in a released table. As a rectification to this problem, 
t-closeness proposes a condition that bounds the cumulative differ- 
ence between the frequency distribution of SA values in an EC and 
their overall distribution [20]. Yet, as we will discuss, such a bound 
fails to provide a meaningful privacy guarantee that lays grounds 
for effective and human-understandable policy [25]. 

Perturbation-based methods add noise to the data so as to achieve 
a privacy property. The models in [10, 30, 5] impose a bound on an 
adversary's posterior confidence about a data property in relation to 
the prior one; however, they measure confidence gain in absolute, 
not in relative terms. Other noise-adding methods enforce differ- 
ential privacy [9], which guarantees that the effect of any particular 
individual's data on a query result is dominated by the noise; in 
other words, the result is broadly the same, regardless of whether a 
certain individual has contributed her true information. Yet, as [6] 
shows, an individual's SA value can be inferred from differentially 
private data with non-trivial accuracy, while the added noise can 
dominate small values in the results of aggregate queries [32]. 

In this paper, we propose /3-likeness: a robust and intuitive model 
for microdata anonymization, postulating that an adversary's con- 
fidence in a tuple's SA value should not increase in relative terms 
by more than a threshold after seeing the published data. We ac- 
company this model with two anonymization schemes tailored for 
its particular requirements: one based on generalization, and one on 
perturbation; the latter can better handle remote outliers. We exper- 
imentally demonstrate that our schemes: (i) provide effective pri- 
vacy guarantees in a way that state-of-the-art t-closeness schemes 
cannot; and (ii) are more efficient than competing approaches. 
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2. RELATED WORK AND ARGUMENT 

The first model suggested for anonymizing microdata while pre- 
serving their integrity was fc-anonymity [29]; it suggests grouping 
tuples in ECs of at least k tuples each, with indistinguishable QI 
values. As the problem of optimal (i.e., minimum-information- 
loss) fc-anonymization is NP-hard [23] in non-trivial cases, past 
research has proposed several heuristics. Such schemes transform 
the data by generalization and/or suppression. Generalization re- 
places, or recodes, all values of a QI attribute in an EC by a range 
containing them. For example, QI gender with values male and 
female can be generalized to person, and QI age with values 20, 25 
and 32 can be generalized to [20, 32]. Suppression is an extreme 
case of generalization that deletes some QI values or even tuples. 

Still, the fc-anonymity model suffers from a critical limitation. 
While its objective is to conceal sensitive information, it pays no at- 
tention to non-QZ sensitive attributes (<SL4s). A fc-anonymized table 
may contain ECs with so skewed a distribution of SA values, that 
an adversary can still infer the SA value of a record with high confi- 
dence. To address this limitation, [22] proposed ^-diversity, which 
postulates that each EC contain at least I "well represented" SA 
values, where "well represented" can be defined in diverse ways. 

Still, i?-diversity fails to guarantee privacy when the distribution 
of SA values differs substantially among ECs and from their over- 
all distribution; thus, it is vulnerable to a skewness attack [20]. For 
instance, assume a 10-diverse form T' of a medical record table T, 
in which 0.1% persons are infected with HIV, and an EC Q £ 7~' 
containing 10 distinct SA values, with one occurrence of HIV. The 
probability of HIV is 10% for a tuple in Q, but only 0. 1% for a tuple 
in T. This 100-fold increase of probability is a significant, hence 
undesirable, information leak. Furthermore, a similarity attack [20] 
is likely when the SA values in an EC are semantically similar. For 
example, a 3-diverse table can be generated from Table 1 by putting 
the first 3 tuples in EC Qi, and the rest EC G2- Regardless of their 
diversity, all tuples in Q\ indicate a nervous problem. 



ID 


Name 


Weight 


Age 


Disease 


01 


Mike 


70 


40 


headache 


02 


John 


60 


60 


epilepsy 


03 


Bob 


50 


50 


brain tumors 


04 


Alice 


70 


50 


heart murmur 


05 


Beth 


80 


50 


anemia 


06 


Carol 


60 


70 


angina 



Table 1: Patient records 



To forestall these attacks, Li et al. proposed i-closeness, which 
requires that a cumulative difference of the SA values' distribution 
within any EC from the one in the overall table does not exceed a 
given threshold t [20]. The t threshold is meant to constrain the in- 
formation an adversary gains after seeing a single EC, with respect 
to that provided by the full released table. Just like ^-diversity is 
open to many ways of measuring the number of "well-represented" 
values in an EC [22], the i-closeness model is open to diverse ways 
of measuring the cumulative difference between the overall SA dis- 
tribution, V , and that in an EC, Q. One option is the Earth Mover's 
Distance (EMD) [28]. Another proposal [20] first transforms V 
(Q) to V (Q) by kernel smoothing, and then calculates the Jensen- 
Shannon divergence between V and Q as the approximate distance 
between V and Q. Last, the Kullback-Leibler divergence is used 
in [27]. Yet these functions all interpret the i threshold as a bound 
on the cumulative difference between two frequency distributions. 
Indeed, this interpretation emanates out of the t-closeness model 
itself [20]. Still, a privacy model should provide grounds for effec- 
tive and human-understandable policy [25]. Models that bound a 
cumulative function of frequency differences between distributions 
fails to provide a comprehensible relationship between the t thresh- 
old and the privacy it affords. In particular, such models do not pay 



due attention to less frequent SA values, which are more vulner- 
able to privacy exposure; and do not distinguish between positive 
and negative variation in an SA value's frequency. 

We first elaborate on EMD. Assume a data set VB with SA val- 
ues HIV and Flu. If the overall SA distribution between them is 
V = (0.4, 0.6), and their distribution in an EC is Q = (0.5, 0.5), 
then EMD(P, Q) = 0.1. Still, if their overall distribution is V' = 
(0.01, 0.99) and their distribution in an EC is Q' = (0.11, 0.89), 
then EMD(P' , Q') = 0.1 again. Both cases satisfy 0.1-closeness. 
However, the information gain in the latter case is much larger than 
that in the former: the probability of HIV rises by 25% from 0.4 to 
0.5, but by 1000% from 0.01 to 0.11. In effect, the two cases do 
not afford the same privacy. This example appears in [20], where 
it is noted that EMD does not provide a clear privacy guarantee. In 
fact, not only EMD, but any function that aggregates absolute dif- 
ferences faces a similar problem, since such functions do not pro- 
vide maximum relative difference guarantees [14, 13] about individ- 
ual SA values. In our example, a small relative difference of Flu- 
frequency evens up a large relative difference of HIV-frequency. 

K-L divergence [27] and J-S divergence [20, 21] also fail to pay 
equal attention to all SA values and their relative differences. In 
our running example, assume a dataset where the overall distribu- 
tion of HIV and Flu is V= (0.01, 0.99), and their distribution in an 
EC is Q = (0.03, 0.97). Then the K-L (J-S) divergence J^etween V 
and Q, is 0.0290 (0.0073), while that between V and Q is 0.0133 
(0.0038). Both these alternatives estimate the privacy afforded by 
Q with respect to V as higher than that afforded by Q with respect 
to V . However, the confidence for HIV increases only by 25% in 
the latter case, while it rises by 200% in the former. 

Besides, the anonymization schemes in [20] are mere extensions 
of fc-anonymization techniques [17, 18]. They do not cater to the 
special needs of i-closeness, hence yield low information quality. 
Recently, [4] proposed an anonymization algorithm specialized for 
i-closeness, yet did not discuss the limitations of the model itself. 
Last, the anonymization scheme in [27] uses perturbation and adds 
noise to the data, damaging their truthfulness. 

The privacy model of [10] imposes a bound p2 to the poste- 
rior probability (i.e., after release) of certain properties in the data, 
given a bound pi on the prior probability (i.e., before release). This 
model is modified in [30], where the posterior confidence should 
not exceed the prior one by more than A. These models measure 
the absolute confidence gain (i.e., information leak), hence do not 
sufficiently protect the privacy of infrequent values. For example 
they treat a probability increase from 60% to 80% as tantamount to 
an increase from 1% to 21% in absolute terms, while the latter is 
an increase by 2000% and the former by 33% in relative terms. 

Alternative approaches enforce differential privacy [9]. By this 
model, the data owner adds noise to a query result so as to guaran- 
tee that this noisy result would change very little with the variation 
of a particular individual's data. However, [16] illustrates that dif- 
ferential privacy does not adequately limit inference about an indi- 
vidual's participation in the data generating process. Furthermore, 
and more importantly for the focus of our work, [6] has recently 
shown that, even though the effect of any single individual is dom- 
inated by the added noise, the noise itself is in turn dominated by 
the signal emerging from the whole population. Consequently, one 
can effectively build a Naive Bayes classifier inferring individuals' 
SA values with non-trivial accuracy [6]. 

A recently proposed distribution-oriented privacy model is <5-dis- 
closure-privacy [3]; it requires that for any SA value Vi with fre- 
quency pi in the original table, its frequency in any EC, qt, should 

be such that log ( — ) < 5. Yet this model fails in two respects: 
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(1) since log(gi) is defined only for qi > 0, 5-disclosure-privacy 
strictly requires that each SA value in the original table occurs in 
every EC; (2) given a sufficiently large value of p, and a modest 
value of 5, 5-disclosure-privacy does not effectively upper-bound 
qi, hence allows for absolute certainty of one's SA value, which is 
exactly the kind of leak it is meant to prevent. These properties 
render 5-disclosure-privacy unnecessarily rigid, in one way, and 
yet exceedingly lax, in another way. Besides, [3] does not pro- 
pose an anonymization algorithm tailored for 5-disclosure-privacy; 
it only points out that the Mondrian fe-anonymization algorithm 
[18], adapted for 5-disclosure-privacy (as well as for ^-diversity 
and t-closeness), yields high information loss. This negative result 
is not surprising; after all, Mondrian simply partitions the data to 
disjoint ECs, hence is ill-suited for models looking into the sen- 
sitive values in an EC, as observed in [12] In its conclusions, [3] 
observes that better anonymization algorithms are needed for those 
models, but does not provide such algorithms; it focuses on a nega- 
tive result without attempting to ameliorate it. In this paper, we pro- 
vide a meaningful distribution-oriented privacy model that avoids 
the drawbacks of 5-disclosure-privacy and t-closeness, as a well as 
an anonymization algorithm specifically designed therefor. Thus, 
our work goes beyond [3] in all these respects. 

3. THE PRIVACY MODEL 

This section introduces our privacy model. Our model assumes 
that the SA distribution in VB is public knowledge, and constrains 
the <SA-related information gained by the table's publication. Table 
2 gathers together the notations we use. 



VB 


Original microdata table 


SA 


Sensitive attribute in VB 


V — {Vl, V2 Vm\ 


The domain of SA 


Ni 


Number of tuples with v% in T>B 


v% = N,/\VB\ 


Frequency of v^, in DB 


V = (Pl)P2, • • - ,Pm) 


Overall SA distribution in VB 


9 


Equivalence class 


Q = (qi, qi-, ■ ■ ■ , q m ) 


SA distribution in Q 



Table 2: Notations 



Definition 1 (information gain). Assume that VB is a 
table with a sensitive attribute SA. Let V = {vi,V2, ■ ■ ■ , v m } 
be the SA domain, and V = (pi,p2, ■ ■ ■ ,Pm) be the overall SA 
distribution in DB. Suppose that Q = (qi, q2, ■ ■ ■ , q m ) is the SA 
distribution in an equivalence class Q, formed by tuples from VB. 
The information gain on any SA value Vi G V is D(p;, qi), where D 
is a distance function between pi and qi. 

We say that the information gain on Vi is positive, when Pi < qi, 
and negative, when pi > qt . Negative information gain lowers the 
correlation between a personal record and Vi in EC Q below that in 
the whole table. In most cases, such gain enhances privacy. How- 
ever, there may exist SA values such as heterosexual, for which 
a reduced likelihood may inadvertently violate privacy. Neverthe- 
less, we assume that the SA domain always includes the negation 
of such values. Thus, negative information gain on heterosexual 
always appears as positive gain for homosexual. Therefore, we can 
directly control the positive gain on the value (such as homosexual) 
that poses the privacy threat. For a more general case such as mar- 
ital status, the negative gain on SA value married can imply that 
an individual is more likely to be divorced or widowed. However, 
we assume that the SA domain contains all the values of interest. 
Hence, the relative negative gain of married can be transformed to 
the positive gains of divorced, and widowed. Based on the above 
reasonable assumption, we are concerned with positive information 
gain; negative gain can be treated symmetrically if circumstances 
demand it (see Section 7). We define basic /3-likeness as follows. 



Definition 2 (basic /3-likeness). Given table VB with 
sensitive attribute SA, let V = {y\, . . . , v m } be the SA domain, 
and V — (pi, . . . ,Pm) the overall SA distribution in VB. An EC 
Q with SA distribution Q = (qi, . . . , q m ) is said to satisfy basic 
/3-likeness, ifandonlyifmax{D(pi,qi)\pi € V ,Pi < qi} < ft, 
where {3 > is a threshold. 

For a table VB' anonymized from table VB to obey /3-like- 
ness, all equivalence classes Q C VB' have to conform to /3-like- 
ness. Contrary to previous models [20, 3, 21, 27], basic /3-likeness 
clearly quantifies the relationship between the /3 threshold and pos- 
itive information gain. Thanks to the maximum-distance threshold 
it imposes, it inherently safeguards against skewness attacks and 
semantic attacks [20]. Last, as it clearly distinguishes between pos- 
itive and negative information gain, and accepts SA values absent 
from an EC, it allows for more flexibility in anonymization, hence 
higher information quality, than the closest related model, 5-dis- 
closure-privacy [3]. Apart from specifying a maximum, instead of 
a cumulative, distance threshold, we should also define the distance 
function D in an appropriate manner. As we have argued, a measure 
of absolute difference does not serve our purposes, since it fails to 
protect less frequent SA values. We opt for relative difference in- 
stead, and define the distance function as D(p», qi) — qi ~ Pi . This 
function obeys the monotonicity property. 

Lemma 1 (Monotonicity Property). Assume that SA 
value Vi € V has frequency pi in the overall table VB, q\ (qfj 
in EC Qi (Q2), generated from tuples in VB, and qf in Qi U Q2. 
Then D(p i; qf) < max{D(pi, qj), D(p i; qf)}. 

PROOF. Assume there are m (112) tuples with Vi in Q\ (§2). 

Thpn a 1 — "1 a 2 — "2 „3 _ n 1 +n 2 _ l\\<3\\+ll\Gl\ < 

men q t - q t - j^j, q { - [^j+rgj - [eTT+lOj - 
max{q},qf}. Thus, D(p i; qf) < max{D(p i; qj), D(p i; q t 2 )}. □ 

The monotonicity property ensures that a union of two ECs yields 
no larger distance between pi and qi than its united parts. Hence, 
ECs violating /3-likeness can be transformed to follow /3-likeness 
by merge operations. The relative distance function instantiates ba- 
sic ^-likeness by the constraint D(p»,gj) = 9 *~ Pi < /3, where 
Pi and qi are the distributions of any SA value u, 6 V in the 
whole table and an EC, respectively. This constraint amounts to 
an upper bound for the frequency of Vi in any EC, qi, namely 
Qi < (1 + /3) • Pi- Our relative distance function pays due at- 
tention to less frequent SA values. However, this function provides 
a meaningful frequency bound only if (l + /3)-pi < 1; it then caters 
for SA values whose frequency in VB is pi < In our effort 

to pay due attention to such less frequent values, we have discrimi- 
nated against SA values of frequency larger than . Such values 
can assume frequency 1 in an EC. Thus, an adversary identifying 
that a person's record is within such an EC can infer the SA value of 
that person with 100% confidence. The disclosure of such frequent 
SA values may pose a privacy threat. To address this limitation, we 
provide a stronger, enhanced definition of /3-likeness. 

Definition 3 (enhanced ^-likeness). For table VB 
with sensitive attribute SA, let V = {v\ ,v m } be the SA do- 
main, and V = (pi, . . . ,p m ) the overall SA distribution in VB. 
An EC Q with SA distribution Q = (qi, . . . ,q m ) is said to sat- 
isfy enhanced /3-likeness, if and only ifVqt, D(pi,qi) — qi ~ Pi < 
min{/3, — lnpi}, where f3 > is a threshold and In pi is the natu- 
ral logarithm ofpi. 

The inequality constraint in the above definition implies that qt < 
(1 + min{/3, — lnpj}) • pi. We can then define the upper bound 
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that enhanced /3-likeness imposes on the frequency of Vi in an EC 
by function f(pi) = (1 + min{/3, — lnpi}) ■ p», which can be 
decomposed as follows. 

' P i (1 + p) , o< Pl < e - 

Pi (1 — In pi) , 



Nervous and 
circulatory diseases 



f(Pi) 



e-P < pi < 1 



(1) 



The first segment of f(pi) is a linear, monotonically increasing 
function of p t . The second segment is a concave, also monoton- 
ically increasing function of Pi , with derivative — lnpi. The two 
segments meet at Pi = e^ 13 . In effect, f(pi) is a continuous, 
monotonically increasing function of pi in (0, 1] with /(0) = 
and /(l) = 1. Intuitively, the second segment bends the function's 
slope so as not to exceed the maximum value of 1. The monotonic- 
ity of f(pi) implies that an EC Q following the enhanced /3-likeness 
constraint obeys the following properties: 

1. The maximum frequency of an SA value Vi in Q is less than 
1, i.e., f(pi) < 1 for any pi < 1. 

2. For two SA values t> 4 and ve, such that pi < pi, the maxi- 
mum allowed frequency of Vi in Q is less than that of vt, i.e., 
f{Pi) < f(Pt)- 

3. For an SA value Vi that is 'infrequent' in table T>B, with 
Pi < e _/3 , its frequency in Q is at most /3 times larger than 
Pi, i.e., qi < f(pi) = (1 + P) ■ pr. 

4. For an SA value v t that is 'frequent' in VB, with p t > e~ fi , 
its frequency in Q is at most — In pi times larger than pi, i.e., 
<?* < f{Pi) = (1 - In Pi) • Pi < (1 + P) ■ Pi. 

These properties protect privacy for all SA values: infrequent 
values receive due attention, while more frequent ones are disal- 
lowed from assuming frequency values of 1. The /3 parameter 
defines the privacy constraint for less frequent values, as well as 
the frequency threshold e^ 13 above which the privacy constraint as- 
sumes a default form independent of j3. This framework applies for 
any monotonic upper-bound function. Our choice of In pi is only 
a convenient choice that confers the desirable properties. As en- 
hanced /3-likeness provides more robust privacy than basic /3-like- 
ness, in the following we focus on it. Unless otherwise specified, 
henceforth by /3-likeness we mean its enhanced form. 

While (enhanced) /3-likeness defines only an upper bound on qi, 
the cognate (5-disclosure-privacy model [3] amounts to two bounds 

on qi, demanding that jlog(| i )j < 8, or, equivalently, e~ s ■ p t < 

qt < e s ■ pi. Furthermore, there is a fundamental conceptual dif- 
ference between /3-likeness and <5-disclosure-privacy: the former 
always disallows qi values equal to 0, and can allow qi values arbi- 
trarily close to 1 (as its upper bound can assume values larger than 
1), while the latter allows any q t value less than pi, but always dis- 
allows qi values equal to 1 (its upper bound being strictly less than 
1). We argue that both these choices are more reasonable than those 
made by <5-disclosure-privacy. Moreover, we re-iterate that the in- 
troduction of 5-disclosure-privacy in [3] was not accompanied by 
an anonymization algorithm tailored therefor; the model was only 
used as a tool to argue for a negative result, namely that existing 
fc-anonymization algorithms [18], adapted to (5-disclosure-privacy, 
yield unacceptably high information loss [3]. In contrast, our work 
aims at a positive result. 

4. GENERALIZATION-BASED SCHEME 

In this section we first introduce the metrics to measure the in- 
formation loss by the generalization. Then we present an obser- 
vation, which motivates our algorithm. After that, we design our 
generalization-based algorithm customized for /3-likeness. 



1 Nervous 


Circulatory 


| diseases 


diseases 



I Headache Epilepsy! rBralrTl I Anemia Anginal I Heart I 
1 1 turners ' ' |myrmur| 

Figure 1: Domain hierarchy for diseases 

4.1 Information Loss Metrics 

To solve the problem posed by the /3-likeness model, we need to 
fulfill the /3 constraint while giving up little information. We use an 
information loss metric to assess the amount of information ceded 
for the sake of privacy. Different utility objectives would require 
different metrics. When the purpose the data is to be used for is not 
known in advance, a general metric can be used, as in [12]. 

Assume a set of QI attributes QI — {Ai,. . ., A d } and an EC 
Q. Given a numerical attribute NA € QI, let [Lna, Una] be its 
domain and [l NA , u NA ] the (generalized) range of its values in Q; 
then the information loss (IL) regarding NA in Q is: 



TLna(G) 



t ft 



Una — Lna 



(2) 



Given a categorical attribute CA, we surmise a generalization 
hierarchy Hca on its domain (Fig. 1). Let a be the lowest common 
ancestor of all CA values in Q; then, the IL regarding CA in Q is: 



TCca(Q) 



0, 

leaves(a) 
leaves(W CA ) 



leaves(a) = 1 
otherwise 



(3) 



where leaves(a) is the set of leaves under a, and \eaves(HcA) 
the set of all leaves in Hca- Then the total IL of Q is: 



=Y, Wi KIMS) 



(4) 



where w t is a weight for Ai, with J2i=i w » = 1- m our exper- 
iments we set Wi = h. The Average Information Loss on a table 
VB, published as a collection of ECs Sg, is: 



AIL(Sg) = 



E geSg \g\ x ix(g) 

\VB\ 



(5) 



We aim to attain ^-likeness on VB at a low value of AIL(Sg). 

4.2 An Observation 

The intuition behind our generalization-based method emanates 
from the following observation. Assume T>B is partitioned into a 
set of buckets by a 'group-by' on SA. If we form ECs by selecting 
from each bucket a number of tuples proportional to its size, then 
the SA distribution in the formed EC will be the same as the global 
distribution. On the other hand, if we partition T>B into buckets 
allowing (all tuples of) more than one SA value per bucket, and 
then form ECs in a similar fashion, then there will be some vari- 
ation in SA distributions among ECs. We aim to configure this 
process so as to allow for such variation to the extent permitted by 
the /3 constraint. An akin methodology is followed in SABRE [4], 
an algorithm for the t-closeness model. Yet, unfortunately, SABRE 
cannot be applied on other distribution-based models, as it caters to 
the particular requirements of f-closeness, looking at the semantic 
distance between SA values in order to bound the EMD-difference 
of distributions between each EC and the overall table. In contrast, 
our algorithm should bound the variation in each SA value's fre- 
quency. The following two definitions clarify our intuition. 
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DEFINITION 4. Given a table VB with sensitive attribute SA, 
a set of buckets tp forms an exact bucket partition of VB iff 
Uvsev ^ ~ w hH e each SA value (tuple) appears in exactly 
one bucket. 



Xj tuples from bucket Bj G tp, Vj the set of SA values in Bj, and 
p tf =mxa Ui ev j {pi}, 3 = 1, % ■ ■ ■ , H '/Vj G {1,2,...,M}, 
]gf < f{Plj)> then Q follows /3-likeness. 



Definitions (proportionality condition). Let tp be a 
bucket partition ofVB. Assume that an EC, Q, is formed with Xj tu- 
ples from bucket Bj £ tp, j = 1,2, ... ,\tp\. Q abides to the propor- 
tionality condition with respect to tp, iff the values Xj are propor- 



tional to \Bj\, i.e., x\ : X2 
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Figure 2: Improved information quality 

EXAMPLE 1. Consider Table 1, where {weight, age} is the 
QI, and disease is the SA. The diagram in Figure 2 shows the 
QI -space and the distribution of tuples, with each QI attribute cor- 
responding to a dimension. A bucket partition tp of this table could 
consist of six buckets of one tuple each, with SA values headache, 
epilepsy, brain tumors, anemia, angina, and heart murmur, respec- 
tively. Taking one tuple from each of those, we could build a single 
EC satisfying O-likeness. Still, such an EC covers the entire QI- 
space, incurring high information loss. An alternative bucket par- 
tition could consist of three two-tuple buckets, tp = {Bi , B2 , B3}, 
with headache and epilepsy in bucket B\, brain tumors and ane- 
mia in B2, and the rest in B3. We can then build two ECs, taking 
one tuple from each bucket, as shown in Figure 2. Tuples in the 
same EC are labeled by the same number in the figure. This parti- 
tioning achieves better information quality, as the areas of ECs in 
Ql-space are smaller. 

While the bucket partition in the above example enables higher 
information quality, it no longer abides by O-likeness. Still, it sat- 
isfies /3-likeness, for /3 > 1, with respect to Table 1. In general, 
it suffices to create ECs so that they attain /3-likeness for a given 
/3 > 0. We propose an algorithm that does so in two phases: it first 
partitions tuples into buckets, and then determines the number of 
tuples each EC needs to draw from each bucket. 

4.3 Bucketization Phase 

Let the SA domain be V = {vi, V2,. • .,«m) and the overall dis- 
tribution of SA values V = (pi,p2,. ■ - ,Pm). We partition V into 
subsets, and use them to divide T>B into a bucket partition tp; all tu- 
ples in T>B with SA values in the same subset of V are pushed to a 
single bucket of tp. Assume EC Q draws Xj tuples from bucket 
Bj G tp, j — 1, 2,. . ., \tp\, and let Vj be the subset of SA val- 
ues in Bj. Then, in the worst case, all Xj tuples may have the 
least frequent SA value in Vj, vt-, with p#. — miri^g^ {Pi}, 
hence the frequency of vi j in Q will be qi j = /3-likeness 
should hold in this case too, i.e., it should be j^r < fiptj) = 
(l + min{/3,— ln(p^ )}) ■pe j , as the following theorem defines. 

Theorem 1 (Eligibility Condition). Let tp be a bucket 
partition ofVB with sensitive attribute SA Q an EC formed with 



PROOF. For any SA value G V, let Bj £ tp be the single 
bucket that contains tuples in T>B with v t as their SA value, hence 
vt G Vj. Since Q draws Xj tuples from Bj, the frequency of vu 
in Q is q k < ^ < f{p tj ) < f (p k ) 
conclude that Q follows /3-likeness. 



Expanding to all v k G V, we 
□ 



Theorem 1 defines the eligibility condition for an EC to follow 
/3-likeness. However, it does not provide a way to specify a par- 
ticular number of tuples Xj to choose from a given bucket Bj, i.e., 
it offers no guidance on how to construct a /3-likeness-complying 
anonymization. To overcome this lack of guidance, we assume that 
ECs are formed following the proportionality condition. Under this 
assumption, it holds that j^y = -jl^gj = 2~2 v -ev- Pi' an ^ tne nex t 
lemma can be easily deduced from Theorem 1 . 



LEMMA 2. Let Q be an EC that follows the proportionality 
condition with respect to a bucket partition tp of TJB with sensi- 
tive attribute SA, Vj the set of SA values in bucket Bj G tp, and 
p tj = min Vt< z Vj {Pi}, j = 1,2,. . . ,\tp\. //Vj G {1, 2, . . . , \tp\}, 
zZvi^v Pi — f(P e j )• tnen Q follows /3-likeness. 

Lemma 2 defines the condition that the frequencies of a subset of 
SA values Vj C V should obey, so that, if all values in Vj are put in 
the same bucket Bj by a bucket partition tp, then ECs obeying the 
proportionality condition with respect to tp satisfy /3-likeness. This 
condition is trivially satisfied by a strict partition having a single 
SA value per bucket. We aim at a looser bucket partition that sat- 
isfies the condition of Lemma 2 in a non-trivial manner, with more 
than one distinct SA values per bucket (as in Example 1). 



Function DPpartition (VB, SA) 



1 Let V = {v\ ,v 2 ,..., v m }, V — (pi , P2 , ■ • ■ , Pm)\ 

2 Assume thatp^ < p n +±, where n — 1, 2, . . . , m — 1; 

3 N[0] = 0; 

4 S[0] = 0; 

5 for e=l to m do 

6 JV[e] = N[e - 1] + 1; 

7 S[e] = e; 

8 b = e - 1; 

9 while b > and Combinableffc. e) — true do 

10 if N[b- 1] + 1< N[e] then 

11 JV[e] = N[b- 1] + 1; 

12 S[e] = b; 

13 6=6-1; 

14 Initialize tp to be empty; 

15 e — m; 

16 while e > do 

17 b = S{e\\ 

18 Create bucket B, having tuples with SA values in {vt,, fb+i 

19 tp = ipU{B}; 

20 e = 3[e] - 1; 

21 Return tp; 



,v e y, 



We develop a bucketization scheme for this task. We start out 
by representing, V, the set of SA frequencies in T>B, in ascend- 
ing order, pi < Pi+i, i = 1, . . . , m — 1. By Lemma 2, a set 
of consecutive SA values in V, Vb, Vb+i, . . . ,v e , are allowed to 
be in the same bucket provided that \\^ i=b Pi < f(pe), where 
pi = min{p(,,pi,+i, . . . ,p e }. Our scheme, presented in Function 
DPpartition, partitions V by dynamic programming, so as to mini- 
mize the number of buckets. Let N[e] denote the minimum number 
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of buckets to which we can partition the prefix of e elements in V, 
i.e., vi, «2, • • • , v e . The value of N[e] is calculated recursively as: 



N[e]= min {JV[6-1]} + 1 

{b\ Combinable(t),e) — true} 



(6) 



Function Combinable(&, e) checks whether SA values v b , . . . , v e 
can make a bucket, i.e., whether J2i=bPi < /(P*)> wnere Pi = 
min{pi,,p6+i, • • • , p e }- The DP base is N[0] = 0. 

DP partition has two parts. The first part (steps 3-13) runs the DP 
recursion of Equation 6 to evaluate the final minimum value N[m] 
and split V into segments accordingly; thereby, it needs to assess 
the combinability of m 2 potential buckets. To assess combinability, 
we maintain the running 2~2Pi within a bucket, updated in O(l) at 
each step; the min{pi} within a bucket is its first element. The 
complexity of this part is 0(m 2 ). The second part (steps 14-20) 
uses the first-part results to build the bucket partition. Tuples with 
the SA values in the same segment make a bucket (step 18), in 
0(\VB\). The overall time complexity is 0(m 2 + \VB\). 

4.4 Reallocation Phase 

The bucketization phase of our scheme delivers a bucket parti- 
tion p of VB. We have so far assumed that ECs are formed from 
ip following the proportionality condition. However, a strict ad- 
herence to this condition may result in large ECs, incurring high 
information loss. For example, if the size of some bucket Bj G tp is 
a prime number (other than 2), then, to strictly follow the propor- 
tionality condition, we should form an EC out of the whole table. 
We should rather relax the condition: it should suffice that the num- 
ber of tuples xj chosen from bucket Bj in EC Q be approximately 



proportional to the size of Bj, i.e., 



SI 



\T>B\ 



T, VteVj Pi- The 



rationale for this relaxation is as follows: the bucket partition tp re- 
turned by DPpartition obeys the inequality 2~2 ve v- Pi — f(P*j) 
(Lemma 2). Then, if Sr « 2~2 v -ev- Pi (i.e., if we draw tuples into 
ECs approximately proportionally to the size of the bucket they hail 
from), then the eligibility condition < f{pt j ) (Theorem 1), and 
therefore /3-likeness, will be still easily achievable. 

To ensure /3-likeness, we determine the EC sizes by construct- 
ing a binary tree, the ECTree, in a top-down fashion. We start 
with a bucket partition p = {B\ , . . . , B\ v \ }. The root of the tree 
r represents a potential EC that contains all tuples in VB, i.e., 
\Bj\ tuples from bucket Bj. We denote these contents as r = 
[|f?i|, . . . , \B\ V \ |] . This can be a valid EC, but we prefer smaller 
ones. Thus, we proceed to split r into two children (each rep- 
resenting an EC), dividing each Bj into B] and Bj. The root's 
left child cl contains Bj and the right child cr contains Bj, j = 
1,2, . . . ,\p\. To ensure that Bj and B^ have approximately the 

same size, we set \B]\ = round ^^^^ and \B]\ = \Bj\ — \Bj\. 
The split is allowed only if both cl and cr satisfy the eligibility 
condition (Theorem 1), hence can form ECs satisfying /3-likeness. 
Assume the left child of r is c L = [\B\ |, . . . , \B\^ |] . Then, for the 

3 1 < 



eligibility condition to be satisfied, it should hold that — ^ 

f(pi j ). An analogous condition applies for cr. If splitting r into 
cl and cr is allowed, we proceed to check whether we can split 
cl and cr themselves. When no node can be split further, we 
get a final ECTree, in which each leaf node configures the number 
of tuples an EC should get from each bucket. A simple function, 
biSplit((£>), returns the list of leaf nodes. Example 2 illustrates this 
process. 

EXAMPLE 2. Let disease be a categorical SA with the domain 
hierarchy of Figure 1. Consider a table, containing 2 tuples with 



headache, 3 with epilepsy, 3 with brain tumors, 3 with ane- 
mia, 4 with angina, and 4 with heart murmur. Assume /3 = 
2. The overall SA distribution is V = (pi,p2,P3,P4,P5,P6) = 
(&> 15 > 15- T5)- /(Pi) «0.31, f( P2 ) = /(ps) = f( P 4) « 

0.45, and f(ps) = f{p&) « 0.54. The bucketization phase returns 
a bucket partition of the table, p = { B\ , B2 , -B3 }, where B\ ac- 
commodates tuples with SA values headache and epilepsy, B2 
brain tumors and anemia, and B3 the remaining two. The wot 
node r = [5, 6, 8] in Figure 3 represents an EC with 5 tuples from 
Bi, 6 from B2, and 8 from B3 (i.e., all tuples in the table). We 
split r into c\ = [2, 3, 4] and C2 = [3, 3, 4]. Then EC ci has size 
9, and contains 2 tuples from B\ with | < min{f(pi), f{p2)}, 
3 from B2 with | < min{/(p3), f(j)4)}, and 4 from B3 with 
I < min{/(p5), f(pe)}. Thus, ci obeys the eligibility condition 
(Theorem 1). Likewise, C2 also satisfies the condition. Thus, split- 
ting r into ci and C2 is allowed. Recursively, we can split ci into 
[1, 1, 2] and [1, 2, 2], When we try to split C2 into g\ = [1, 1, 2] 
and <;2 = [2, 2, 2], we find that Q2 does not satisfy the eligibility 
condition, as | > min{/(pi), f(p2)}, hence this split is not al- 
lowed. Figure 3 shows the final tree, with each leaf node indicating 
the number of tuples an EC should draw from each bucket. In the 
general case, an EC could also draw tuples from some bucket. 

[5, 6, 8] 



[2, 3, 4] 



[3, 3, 4] 



[1,1,2] [1,2,2] 
Figure 3: Dynamic EC size determination 

4.5 BUREL 

We now put the above phases together to devise BUREL, an al- 
gorithm that SCcketizes tuples into buckets and ^allocates them 
from buckets to ECs so as to attain /3-Likeness. The distinctive and 
novel feature of this algorithm, as opposed to algorithms for fe-ano- 
nymity, i?-diversity, and £-closeness, is that it distinguishes among 
SA values by their frequencies and builds its operation and reason- 
ing around this frequency-based partitioning. 

The bucketization phase of BUREL returns p, a bucket parti- 
tion of T>B (step 2). Then, its reallocation phase (function biSplit) 
determines the number of tuples each EC should draw from each 
bucket at a leaf of the ECTree and returns a list of arrays S a con- 
taining these size values (step 3). Specific ECs following the pre- 
scribed sizes are then materialized (steps 4-9). Given an array 
a G S a , BUREL retrieves a,j tuples from Bj G p, where a,j is 
the j th element of a and j = 1, 2,. . . ,|<^|, and forms an EC Q out of 
them (steps 6-8). 

Algorithm: BUREL ( VB, SA, ft ) 

1 Let {ui , V2 , ■ ■ ■ , Vm} be all the 614 values in VB, and 
{pi , P2 , ■ ■ ■ , Pm } be their distributions; 

2 <p = DPpartition(DZ3,fiA); 

3 S a = biSplit(<p); 

4 foreach array a in S a do 

5 Create an empty EC, say Q; 

6 foreach a.j, j th element of a do 

7 ecj = Retr\eve(Bj,aj); 

8 add ecj to Q; 

9 Output Q; 

When retrieving tuples from buckets, BUREL is indifferent to 
their SA values. The /3-likeness between a constructed EC Q and 
the whole table VB is guaranteed by Theorem 1 ; tuple selection is 
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guided by information loss considerations, as prescribed by our in- 
formation loss metric (Section 4.1). This metric requires the Mini- 
mum Bounding Boxes of ECs to be small. Accordingly, we employ 
function Retrieve(£>i, ai) (step 7), which greedily picks tuples of 
similar QI values. This greedy selection of nearby tuples works as 
follows: We define a multidimensional space with each QI attribute 
as a dimension. The mapping to such a Q7-space for a numeri- 
cal QI attribute NA is straightforward. The axis of a categorical 
QI attribute CA is formed by the order provided by a pre-order 
traversal of the leaves in its domain hierarchy Hca- Each tuple is 
represented as a point in this Q/-space. When forming an EC Q, 
BUREL first randomly picks a tuple x from a bucket of ip in Q, and 
then finds the a,j nearest neighbors (by Euclidean distance) of x in 
each bucket Bj, j = 1, 2,. . . ,|<p|, and adds them into Q, until the 
size specifications are satisfied. Still, this process can be computa- 
tionally demanding even with an index structure [8]. Thus, we de- 
vise an efficient heuristic using the Hilbert curve [24], a continuous 
fractal that maps regions of QJ-space, hence tuples, to ID Hilbert 
values. Tuples close in QZ-space are likely to have nearby Hilbert 
values. BUREL sorts tuples in Bj by their Hilbert values, and uses 
this order to select the a,j nearest neighbors of a tuple x within each 
bucket. We find the nearest Hilbert-neighbor x of x within bucket 
Bj by binary search, and then expand to the next closest aj neigh- 
bors to x. The average time complexity for this search operation is 
Od^ffllvKlog^+T^^r)), where |<p| is the number of buckets, 
the average size of a bucket, and g ^, the average number 
of tuples drawn from a bucket to form an EC. 

5. PERTURBATION-BASED SCHEME 

Our generalization-based solution, achieves /3-likeness and also 
provides identity anonymity, like all generalization-based methods 
do. However, in case a data set contains a few remote outliers, 
these outliers may force a highly unsatisfactory solution by gener- 
alization. Similarly unsatisfactory solutions can be obtained in case 
of extremely infrequent SA values. For example, consider a dataset 
T>B, in which only one tuple t has SA value v. Then, to attain 1- 
likeness, we would have to create an EC containing t and at least 
half of the tuples in T>B. We deduce that an alternative solution 
is desirable in order to handle such irregular cases, even at the ex- 
pense of identity anonymity. To that end, we propose an approach 
that anonymizes each tuple independently by perturbing SA values 
while preserving QI values intact. We reiterate that, for a given 
SA value Vi, /3-likeness considers its frequency in the whole table, 
Pi, as prior confidence, and constrains an adversary's information 
gain on Vi after seeing the published data, bounding the posterior 
confidence. We aim to achieve this target by perturbing SA val- 
ues only; our scheme resembles a randomized response procedure, 
albeit having a different perturbation probability for each SA value. 

Definition 6 (/3-likeness by perturbation). Given a ta- 
ble VB with sensitive attribute SA, let V = {v\ ,v m } be the 
SA domain, and V = (pi, . . . , p m ) the overall SA distribution in 
T>B. A perturbation on T>B that randomizes SA values satisfies 
/3-likeness, iff the adversarial posterior confidence in Vi £ V after 
seeing the randomized data is at most /(p»), i = 1, 2, . . . , m. 

To build a solution that achieves /3-likeness by perturbation, we 
adapt the concept of upward (pi, p2)-privacy [10] as follows. 

DEFINITION 7 ((pii, P2i)-PRIVACY). Let Vi £ V bean orig- 
inal SA value, and v £ V be any SA value after perturbation. We 
say that (pu, Pu )-privacy is satisfied on Vi, iff the adversarial prior 
confidence in v, is C(U = Vi) = pu, and the posterior confidence 
after seeing v is C(U = Vi\V = v) < P2i- 



While (pi, p2)-privacy does not distinguish among SA values, 
our adaptation does. Given these definitions, we can achieve /3- 
likeness by setting pu = p, and p2i = /(p») for each Vi £ V. 

THEOREM 2. Let i>i £ V be an original SA value, and v £ V be 
an SA value after perturbation, such that 3u £ V: Pr(tt — > v) > 0, 
where u — > v denotes that u has been perturbed to v. If it holds that 

Pr(v, 

pli 1 — P2 



Vvj £ V 

Pr( Vj 



V) . P2i 1 - Pli 

< = 7i 



(7) 



then (pu, p2i)-privacy is satisfied with pu = C(I7 = Vi) > 0. 



PROOF. Assume that (pu, p2;)-privacy is not satisfied, that is, 
C(U — Vi\V = v) > p2i- For the event of seeing v it holds that 
C(V = v) = 2^vuev ^(£7 = u ) ■ Pr(w — > v) > 0, as v must have 
been produced by some original value u. Let Vj be an SA value 
least likely to have been perturbed to v, i.e.: 

v j £ {u £ V| Pr(u — > v) = min Pr(w' — > v)} 
By the definition of conditional probability it holds that: 

C(U = Vl \V = V )= ^ U =^ P J^^ (8) 

and, since Vj is least likely to have yielded v, it is: 

C(U**\V = v)> SV±$£^ (9) 

Since, by our assumption, C(U = Vi\V = v) > p2i > and 
C(U — Vi) — pu > 0, from Eqs. (8) and (9) we get: 



C(U^Vi\V = v) > Pr( Vj 



C(U + Vi 



(10) 



C(U = Vi\V = v) ~ Pr(vi^v) C{U = Vi) 
Inequality (7) holds for Vj, thus we can rewrite Inequality (10) as 

1 - C(U = Vi\V = v) ^ 1 1 - C{U = Vi) 



C{U = Vi\V = v) ~ 7; C{U = Vi) 
Still, '^g;^ 1 = i^i, hence Inequality (11) yields: 

1 - C(f7 = Vi\V = l~P2i 



(11) 



P2i 



C(U = Vi\V = v) 7* pi, 

C(U = Vi\V = v)< p 2l 

which contradicts our assumption. □ 

Due to Theorem 2, Inequality (7) provides a sufficient condition 
for /3-likeness to hold. We aim to achieve this condition by uniform 
perturbation, which maximizes the utility of randomized data [2]. 
Given an input SA value m £ V, uniform perturbation tosses a coin 
with probability «j £ (0, 1] for heads and 1— a, for tails, and, in the 
latter case, replaces Vi by a randomly selected value v £ V. Then: 

rr(v^v)-^ (i_ a .)/ m if Vi ^ v ( u > 

LEMMA 3. Given any perturbed value v, Pr(vi — > v) is maxi- 
mized when Vi = v. 

PROOF. By Equation 12, if Vi = v, then Pr(vi — > v) = ai + 

<j) = 1 " J . Since ai,aj £ 



(1 — cti)/m. For Vj ^ v, Pr(«j — > v) = 
(0,1] itisPr(vi^v)-Pr(vj^v) = (m ~ 1) m "' + " J > 0. □ 

For the sake of utility, we need to maximize the probability that 
input SA values remain unchanged, i.e., to set a; as high as pos- 
sible for each Vi £ V. However, for a given Vi, the value of cti 
should allow Inequality (7) to hold for v = Vi and for any Vj=/=v; if 
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it holds in these extreme cases, then it also holds for all other values 
of variables v and Vj. Substituting the values given by Equation 12 
for v — Vi and Vj ^ v in Inequality (7), we get "^-a'o/m" 1 — ^ i 
Vj G {1, 2, . . . , m} \ {i}. The worst-case value the denominator 
in the last inequality (i.e., the probability Pr(«j — >v) of perturbing 
Vj 7^ v to u) can assume is Cm = minJJL x { 1 ^ th } . To calculate a 
bound for on, we require that the inequality holds in the worst case: 



a.i + (1 — cti)/m 



Cti < 



Cm 

m • 7i • Cm - 1 
m — 1 



< 7i 



(13) 
(14) 



Since by definition Cm < if Inequality (13) holds, then, 

for a given i, it will hold that Q '^ 1 ~"^ m < jf, consequently, it 
will be oii < 1 J^_ l . Using this upper bound of on, we infer that, 
for each i G {1, 2,. . ., mj it will be > — J. — \ n effect, it 

should also be Cm = min^r { ^ } > mi<U { } = 

7g+ L-i ' wnere 7< = max 5T=i{7h}- We have thus derived a lower 
bound for Pr(t>j — > v) with «j / v, namely C M = 7g+ ^ 1 _ 1 . To 
ensure that Inequality (14) always holds, we must guarantee that it 
also holds for the lower-bound case; thus, the highest value we can 



safely assign to each on is cti 



m _ 1 . Eventually: 

m 'Ti- C M- 1 



m — 1 



THEOREM 3. Perturbation by Eq. (12) with on 
7i = ^7~' llpa- ' Pi» = Pi' /° 2i = /(P»)> V y i G V, satisfies (3-likeness. 
PROOF. Due to Lemma 3, for any SA value Vj, perturbation by 



Eq. (12) gives H^l] 



< 



OLi + (1 — Cti) I 7 



1i, I 



Then, by Theorem 2, (pu, p2i)-privacy holds Vt>; G V, hence /3- 
likeness is satisfied. □ 

We now discuss how we reconstruct the original SA distribution 
from the perturbed data to answer aggregation queries, which are a 
basis of data mining tasks, as the following [33]: 

SELECT COUNT (*) FROM Anonymi zed-data 
WHERE pred(.4i) AND ... AND pred(A A ) 
AND pred(SA) 

This query has predicates on A randomly selected QI attributes and 
the SA. For each of these A+l attributes A, pred(A) has the form 
of A G Ra, where Ra is an arbitrary interval in the domain of A. 

Perturbation does not affect QI values. We reconstruct the query's 
result by estimating the count of original SA values among those 
tuples that satisfy the query's QI predicates, given the observed 
SA values. In particular, given an aggregation query, suppose that 
St is the set of tuples satisfying the predicates associated with QI 
attributes, Ax G Ra^ AND . . . AND A\ G Ra x ■ Let S' t be the per- 
turbed form of S t - Since perturbation randomizes only SA values, 
each tuple in S t is perturbed in <S t ' with its QI value unchanged. Let 
rii be the number of tuples with SA value Vi in St, i = 1, 2, . . . , m, 
and ei = SJLi P r ( u j - ► v i) ' n j tne expected number of instances 
of Vi in Si. According to our previous discussion, if j = i, then 
Pr(vj -> Vi) =7i • C M , else Pr(vj -> vA = 1 ^-i M ■ Usin g the 

notation Xi = 7i-C M and Yj = ^^-i" • we have E = PMxN, 
where E =< >, iV —<ni,ri2, . . . ,n m >, and 



PM - 



X 1 Y 2 
Yi X 2 



Fi Y 2 



^ rr, 



Xj, 



A data recipient knows neither E nor N, but only observes E' —< 
ei, e' 2 , ■ ■ ■ , e' m >, where e\ is the number of occurrences of SA 
value Vi in S' t . Thus, one can approximately reconstruct N as N' = 
PM -1 x E' =< n[, n' 2 , . . . , n' m >, and estimate the answer to 



a given query as est = 



rti, where 7?sa is the query 



interval of pred(5L4). To facilitate this reconstruction process, we 
publish the perturbed data along with matrix PM; we can also 
release the original global SA distribution V in order to render the 
publication model comparable to that offered by generalization. 

6. EXPERIMENTAL EVALUATION 

In this section we evaluate our schemes. Our prototypes were im- 
plemented in Java and the experiments ran on a Core2 Duo 2.33GHz 
CPU machine with 4GB RAM running Windows XP. We use the 
CENSUS dataset [1], which contains 500,000 tuples on 6 attributes 
as shown in Table 3. For categorical attributes, the value following 
the type is the height of the corresponding attribute hierarchy; for 
instance, attribute marital status is categorical and has a hierarchy 
of height 2. The first 5 attributes are potential Q/-attributes; the last 
(salary class) is the SA. By default, we take the first three attributes 
as QI. The least frequent SA value is 49, with frequency 0.2018%; 
the most frequent SA value is 12, with frequency 4.8402%; f3 = 1 
produces frequency threshold e _/3 w 37%, which marks all SA 
values as 'infrequent', and allows the frequency of any SA value in 
any EC to be at most 4.8402% x 2 = 9.7%. Thus, 1 is a small f3 
value. We use /3 G {1, 2, 3, 4, 5}. We generate 5 microdata tables 
by randomly picking 100K to 500K tuples from the dataset; the 
one of 500K tuples is our default dataset. 



Attribute 


Cardinality 


Type 


Age 


79 


numerical 


Gender 


2 


categorical (1) 


Education Level 


17 


numerical 


Marital Status 


6 


categorical (2) 


Work Class 


10 


categorical (3) 


Salary Class 


50 


sensitive attribute 



Table 3: The CENSUS dataset 

We set the likeness threshold f3 by default to 4. Then, given the 
application of enhanced /3-likeness for any SA value Vi, if p» < 
e~ 4 = 0.018, its frequency q t in any EC should not exceed 5pt; if 
Pi > 1.8%, then it should be q t < (1 — ln(pi)) • pt. We reiterate 
that these bounds apply to each SA value, while their definition 
accommodates both low-frequency and high-frequency values. The 
highest SA value frequency in our data set does not exceed 5%, so 
the frequency of any salary class in any EC will not exceed 20%. 

6.1 Face-to-face with t-closeness 

Our first task is to compare our new /3-likeness privacy model to 
the predecessor distribution-based model of t-closeness. We argue 
that /3-likeness provides a more informative and comprehensible 
privacy guarantee than t-closeness does. Still, in order to create an 
even playing field on which to compare /3-likeness to t-closeness, 
we conducted three face-to-face comparisons as follows. 

In the first comparison, for a given dataset VB and j3, we let 
BUREL transform T>B to T>Bp, satisfying /3-likeness. We then 
measure the closeness tp, by the t-closeness model, between DBp 
and VB, i.e., the maximum EMD of the SA distribution in an EC 
of VBji from its distribution in VB. We then apply t-closeness 
schemes tMondrian [20] and SABRE [4] on VB as well, with 
tp as the t-closeness threshold, to produce VBt'^ and VBf^, re- 
spectively. Then VBp, VB™, and VBt fj achieve the same privacy 
under the criterion of t-closeness, as expressed by tp. Then we 
measure the /3 value achieved by VB^ and VBf^ with respect 
to VB. Given that all three schemes achieve the same privacy in 
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terms of f-closeness, we are interested to compare the privacy they 
ahieve in terms of /3-likeness. Figure 4(a) shows the results (in log- 
arithmic y-axes), as a function of the given /3 parameter. While all 
the three schemes are tuned to ensure the same i-closeness guaran- 
tee, BUREL provides consistently higher privacy by the criterion 
of /3-likeness than SABRE and tMondrian. This result is expected, 
since t-closeness restricts only the cumulative difference between 
SA distributions, indifferent to the relative frequency difference of 
each individual SA value between an EC and the whole table. 

Next, for a given dataset VB and closeness constraint t, we let 
tMondrian (SABRE) transform VB to VBf (VBf), attaining t- 
closeness. We then let BUREL find, by binary search, a value /3 t , 
such that, when it enforces /3 t -likeness on T>B, it produces an ano- 
nymization VBp t characterized by the same (or smaller) closeness 
parameter t as VBf (VBf). Again we get three anonymized ver- 
sions of VB that achieve the same privacy under t-closeness. While 
in our first comparison we arrived at this state starting out with a 
/3 parameter, now we start out with a t parameter. Thus, we avoid 
bias against t-closeness schemes. We now compare the /3-likeness 
achieved by VBf (VBf) to that of VBjs t , as a function of t. The 
results, shown in Figure 4(b), reaffirm our previous findings. 

In our last experiment, given an AIL value I, we let BUREL 
determine, by binary search on its /3 threshold, a value /3;, such 
that the data set VBp t it generates from VB with /3; as the like- 
ness threshold achieves AIL equal to (or smaller than) I. Likewise, 
we determine, by binary search, a value tf (tf), which, used as 
the closeness threshold in tMondrian (SABRE), generates data set 
VB t M (VB t s ) with AIL near I too, allowing for a small difference 
e. Thus, we obtain three data sets VBp l , VB t M , and VB t s, gener- 
ated by BUREL, tMondrian, and SABRE, respectively, which all 
have information loss near /; to ensure the comparison is not biased 
in favor of BUREL, we ensure its AIL value is not greater than 
those of the other algorithms. We then compare the privacy they 
achieve in terms of /3-likeness. Figure 4(c) shows the results. Not 
surprisingly, BUREL provides the highest privacy again, followed 
by SABRE and tMondrian. 

Our results testify that, other factors being equal, state-of-the- 
art ^-closeness schemes fail by a wide margin (as indicated by the 
logarithmic y-axes) to achieve privacy good in terms of /3-likeness. 
Thus, they reaffirm that /3-likeness raises substantially different re- 
quirements from t-closeness, and requires a different approach. 

6.2 Evaluation on Generalization 

In this section we evaluate the performance of BUREL as a /3- 
likeness algorithm in its own field. As there is no previous work 
on /3-likeness, we employ two comparison benchmarks adopting 
some suggestions of related work. First, we devise an algorithm 
for /3-likeness, following the conventional wisdom on designing 
algorithms for new privacy models: We adapt Mondrian [18], a 
fc-anonymization algorithm, to the purposes of /3-likeness, as pre- 
vious works have done for other privacy models [22, 20, 3, 21]. 



Our adaptation, LMondrian, splits an EC only if both resultant ECs 
satisfy /3-likeness. Second, we use the similar adaptation of Mon- 
drian to 5-disclosure-privacy suggested in [3], DMondrian. To ren- 
der DMondrian comparable to BUREL and LMondrian, we set the 
value of S so that the data anonymized by DMondrian obey /3-like- 
ness. As we have discussed, while /3-likeness demands that an SA 
value's distribution in an EC be g; < (l + min{/3,— In Pi})-pi, for 
a given /3, 5-disclosure-privacy requires that e~ s -pi <q t < e s -pi, 
where p t is the overall distribution of Vi in the whole dataset. Thus, 
an algorithm for 5-disclosure-privacy achieves /3-likeness for <5 < 
log(l + min{/3,— lnpi}), for all pi\ in view of all SA values in V, 

we set 5 = log ^1+min |/3,— In ^maxjjj;}^ . We first com- 
pare the three schemes with respect to average information loss and 
wall-clock time. 
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Figure 5: Effect of varying /3 

First, we study performance as a function of the /3 threshold. 
Figure 5 shows the results. As /3 grows, the constraint on the 
relative difference of each SA (i.e., salary) value frequency be- 
tween an EC and the overall table is relaxed, hence information 
quality rises (Figure 5(a)). BUREL outperforms both LMondrian 
and DMondrian in information quality, showing the benefit of a 
scheme tailored for /3-likeness. This result reconfirms the finding of 
[3] that a fc-anonymization algorithm, adapted to <5-disclosure-pri- 
vacy, yields unacceptably high information loss; as we discussed, 
we aim at a positive result and propose a better alternative. In ad- 
dition, given that 5-disclosure-privacy overprotects data by impos- 
ing a constraint on negative information gain, LMondrian performs 
better than its stricter sibling, DMondrian. Remarkably, BUREL 
also outpaces both Mondrian-based schemes in efficiency (Figure 
5(b)). Overall, BUREL achieves almost half the information loss 
of its Mondrian-based competitors in about half the time. 

Next, we investigate the effect of QI dimensionality (size), vary- 
ing it from 1 to 5. As QI dimensionality increases, the data become 
more sparse in QI space, as more high-dimensional degrees of free- 
dom are offered; thus, the formed ECs are more likely to have large 
minimum bounding boxes, and information quality degrades, as 
Figure 6(a) shows. The information loss of BUREL is again lower 
than that of the Mondrian-based methods. In addition, BUREL is 
again the fastest of the three (Figure 6(b)). 

Our next experiment studies the effect of database size, varying 
the size of the microdata table from 100K to 500K tuples. Figure 
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Figure 6: Effect of varying QI 

7 presents our results. Interestingly, data size has no clear effect 
on information quality. This is due to the fact that, as the amount 
of tuples grows, more sensitive values are revealed, imposing their 
own requirements. The mere increase of data density does not help, 
as it would with simpler models like fc-anonymity. Still, the elapsed 
time increases as the table size grows; BUREL is again found to be 
superior in both respects. 
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Figure 7: Effect of varying dataset 

We now examine the utility of the generalized table by aggrega- 
tion queries introduced in Section 5. Each predicate pred(A) in 
the query has the form of A € Ra- Let expected selectivity over 
the table be < 9 < 1. Assuming data are uniformly distributed, 9 
can be achieved if each attribute A selects records within a range 
of length |j4|-6U of its domain, such that (9a) X+1 = 6- In effect, the 
length of Ra should be \A\ -9 "W, where \A\ is the domain length 
of attribute A. Given a query, the precise result prec is computed 
from the original table, and an estimated result est is obtained from 
the anonymized table. To calculate est, we assume that tuples in 
each EC are uniformly distributed, and consider the intersection be- 
tween the query and the EC. We define ^ e3 p~^ ec x 100% as the 
relative error. We measure the median relative error in a workload 
of 10K queries. Relative error is undefined when prec is 0. If prec 
in a query is 0, we drop that query. 

In our first experiment, we use the first 5 attributes in Table 3 as 
QI, with expected selectivity 9 — 0.1, and vary the dimensionality 
of the query, i.e. the number of QI attributes A on which predicates 
are defined. As these attributes contribute to the error, the increase 
of A exercises a negative effect on error. However, as A grows, the 
length of the query range Ra in the domain of each queried at- 
tribute also grows (for constant 9); thereby, the minimum bounding 
box of an EC becomes more likely to be entirely contained in the 
query region. In effect, the error does not depend monotonically 
on A (Figure 8(a)); it does not matter much how many attributes a 
given selectivity 9 is shared among. In the next experiment, we fix 
A to 3, 9 to 0.1, and vary /3. Figure 8(b) shows the results. As j3 
grows, the privacy requirement is relaxed, hence information qual- 
ity rises and the error drops. Next, we set 9 to 0.1, and vary the 
QI size. As the QI size increases, the data tend to be more sparse 
in Q/-space, hence it is more likely that ECs with bigger bounding 
boxes are created. Thus, in Figure 8(c) the workload error increases 
with QI size, for all compared methods, while BUREL presents the 
most modest increase. Last, Figure 8(d) presents the results as a 



function of selectivity 9. As 9 grows, the length of the range Ra 
for each attribute in a predicate increases. This makes the minimum 
bounding box of an EC more likely to be entirely contained in the 
query region, so the estimate becomes more accurate and the error 
smaller. BUREL achieves consistently better utility. 
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Figure 8: Median relative error 

Evaluation on Perturbation 



In this section we evaluate the performance of our perturbation- 
based /3-likeness scheme discussed in Section 5. We keep the QI 
value of each tuple unchanged, and only randomize its SA value ac- 
cording to a certain probability defined in conditional Equation 12. 
We emphasize that there does not exist information loss by gener- 
alized QI values to examine as with BUREL, but we can study the 
utility of perturbed data set, again by aggregation queries. How- 
ever, unlike the query answer estimation for generalized data using 
an intersection between the query and the EC, now we estimate the 
result simply by reconstructing the original SA distribution from 
the perturbed SA values of those tuples that satisfy a query's QI 
predicates as discussed in Section 5. 

Since our /3-likeness scheme by perturbation is built on (pu, 
P2i)-P r i vac y> f° r the sake of convenience we represent it as (pu, 
P2i)-P r i vac y- We emphasize that, on the one hand, BUREL is based 
on generalization, with the desirable property of identity anony- 
mity; on the other hand, (pu, p2i)-prrvacy randomizes the SA value 
of each tuple independently, and is thus immune to corruption at- 
tacks, in which one may infer the SA value of a victim on con- 
dition that they already know the SA values of some individuals 
[30]. However, BUREL and (pu, p2i)-pri vac y are mutually incom- 
parable. Besides, there is no previous work that achieves a privacy 
guarantee comparable to /3-likeness by perturbation; the most re- 
cent related work that offers a privacy guarantee by perturbation, 
[5], is also built on (pu, p2i)-privacy, yet only limits the posterior 
probability of inferring any individual SA value, a privacy guaran- 
tee comparable to ^-diversity. In the absence of another competitor, 
we introduce and compare to a Baseline approach, which publishes 
the exact QI value of each tuple together with the overall SA dis- 
tribution in the original table, in the way of Anatomy [33]. 

Figure 9 shows our results. We first set QI size to be 5, query se- 
lectivity 9 = 0.1, and vary the number of Ql-attributes in the aggre- 
gation queries. The QI value of each tuple remains intact for both 
(pu, p20-prrvacy and Baseline. Thus, only the predicate on SA, 
pred (SA) , incurs an error. As A grows, the query range interval 
Rsa for SA also increases, in effect more tuples satisfy the query, 
and the reconstructed SA distribution is closer to the actual one. 
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Therefore, as Figure 9(a) shows, the workload error decreases as a 
function of A. Next, we set A = 3, 8 = 0.1, and study the effect of 
f). Baseline is independent of /3; the small fluctuation of its curve is 
due to the fact that we randomly generate Ra (i.e., the query range 
interval for an attribute) in each experiment. However, f(pi), the 
allowed posterior confidence of an attacker on SA value Vi, grows 
as a function of j5. A higher value of f(pi) implies a larger a;, 
allowing for a higher probability that an SA value remains intact 
after randomization. Therefore, the data utility rises as /3 grows 
(Figure 9(b)). Next, we set (3 = 4, and vary QI size; Figure 9(c) 
shows the results. As neither (pu, p2i)-prrvacy nor Baseline modi- 
fies any QI value, the utility of perturbed data depends on the input 
data set. Therefore, the workload error does not change uniformly 
with QI size. Last, we study the effect of varying 9. When 8 is 
larger, Rsa also grows. Hence, more tuples satisfy the query, and 
the result becomes more accurate, as Figure 9(d) shows. Remark- 
ably, in all presented cases, the accuracy of our perturbation-based 
scheme consistently outperforms that of the Baseline approach. 

Median relative error „„„ Median relative error 
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Figure 9: Median relative error with perturbation 

7. RESISTANCE TO ATTACKS 

We now discuss the resistance of our model and schemes to sev- 
eral types of attack proposed in the literature. 

A minimality attack [31] applies when an algorithm populates 
ECs with tuples explicitly heeding to privacy considerations, mak- 
ing decisions "uniquely decided by the sensitive value of a particu- 
lar tuple" [34]. BUREL decides first on EC dimensions, consider- 
ing SA values alone. Then, it decides on the particular contents of 
each EC, independently of others, looking only at tuples' QI values 
and heeding to utility considerations; it does not decide whether to 
put a given tuple in one EC or another by looking at its SA value. 
This separation of tasks renders BUREL immune to minimality at- 
tacks. Furthermore, [7] has shown that the minimality attack can 
be easily averted even in the case of algorithms vulnerable to it. 

A deFinetti attack [15] aims to learn the correlation between SA 
values and QI values by building a Bayesian network; it starts by 
assuming a random permutation to assign each SA value to a QI 
value in each EC, and builds a Naive Bayes classifier out of all 
such assignments. Then it evaluates the permutation assigned to 
each EC, and generates an improved one, which is in turn used 
to update the classifier. This iterative procedure goes on until it 
converges. In other words, the classifier exploits divergences be- 
tween the global information, as it appears in the whole published 
table, and local knowledge within each EC, to iteratively correct 
the SA to QI assignments within each EC. We deduce that, if this 
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0.01 
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0.09 
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0.13 


0.04 


8.7 


14.2 
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0.16 


0.04 


7.2 


13.6 


5 


0.17 


0.05 


6.6 


12.6 



divergence is controlled, the success potential of the deFinetti at- 
tack can be correspondingly constrained. The /3-likeness principle 
delimits exactly this divergence by a threshold /3, hence constrains 
how much an attacker learns beyond the overall distribution in a 
published table. We thus argue that /3-likeness curbs the deFinetti 
attack as the value of /3 prescribes. Intuitively, a lower /3 value al- 
lows for smaller divergences and hence lower success rate of the 
attack. We have defined /3-likeness in a way that constrains posi- 
tive, but not negative, information gain, as this is the cardinal need 
in most practical circumstances. Still, a deFinetti attack may also 
exploit negative divergences in order to construct its classifier. In 
case such concerns arise, our model can be straightforwardly ex- 
tended to constrain negative divergences as well, and thereby fur- 
ther enhance its capacity to thwart such attacks. 

Cormode [6] recently conducted an experimental study of the 
deFinetti attack on Anatomy [33], an instantiation of ^-diversity, 
concluding that the attack is effective for small values of £ (2, 
3, 4). Still, as £ rises, the attack's success rate deteriorates. In 
particular, for £ = 5 the rate is below 50%, and when £ reaches 
7 it falls below 30%. As the attack has so far only been imple- 
mented against Anatomy, presenting the privacy of data anonymi- 
zed by BUREL in terms of ^-diversity is relevant in this context. 
The table on the right presents the 
t and £ values achieved in terms 
of t-closeness and ^-diversity, re- 
spectively, for the data sets pub- 
lished by /3-likeness in the exper- 
iment of Figure 4(a), with /3 set to 1, 2, 3, 4, and 5; Avg £ (t) stands 
for the average diversity (closeness) for all the ECs. Notably, for 
reasonable values of /3, £ assumes values no less than 6 for which 
the deFinetti attack's succsess rate is low. 

The hitherto discussed attacks are designed against generaliza- 
tion-based schemes. Our perturbation scheme is not vulnerable to 
them, as it involves no generalization. Moreover, as it randomizes 
each SA value independently, it is immune to corruption attacks 
[30], in which an attacker who is already aware of the SA values 
of some individuals tries to infer that of a victim. Besides, our 
schemes assume the anonymized data are published only once, so 
as to prevent composition attacks [11]. Thwarting such attacks with 
republication under /3-likeness is a problem orthogonal to our work. 

Cormode [6] also suggests an attack on differential privacy based 
on a Naive Bayes classifier. Such a classifier predicts the SA value 
of a tuple t with m QJ-attribute values, tj, 1 < j < m, as: 

v(t) = argmaxPr^] ft™ Pr[t>;] (15) 

The gist of the attack lies in the fact that the conditional probabil- 
ities Pr[t 3 ;\vi] can be accurately learned based on noisy count query 
results extracted from differentially private data. While the noise in 
question conceals the contribution of any individual, its effect on 
the derived Prftj is relatively small [6]; thus, the built classi- 
fier works almost as effectively as in the noiseless case, exploiting 
variations of Pr[tj \vi] values from their unconditional counterpart, 
Pr[tj] to produce a non-trivial prediction of v(t). On the other 
hand, /3-likeness is defined in a way that explicitly bounds exactly 
the variation of these conditional probabilities from their uncondi- 
tional counterpart. Specifically, by Bayes' rule, we get: 



Pr[vi] 



(16) 



For a given sensitive value Vi £ V, Pr[«i] is the prior confidence 
in Vi based on the global distribution of SA values, which we have 
hitherto denoted as pi, while Pr[/u;|ij] is the posterior confidence 
that /3-likeness bounds by f(pi) = (l + min{/3, — \npi})-pi. Then 
/3-likeness guarantees that Pr[tj \vi] < (l+min{/3, —lnpi})-Pr[tj]. 
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Thus, /3-likeness bounds the conditional probabilities that the Naive 
Bayes attack exploits, delimiting the extent to which their values 
vary from Pr[tj]. Consequently, /3-likeness delimits the potential 
for a Naive Bayes attack to succeed, causing Equation (15) to pre- 
dict the most frequent SA value in the table most of the time. 

The preceding analysis has been made without prejudice to the 
publication format, and hence applies to any scheme satisfying /3- 
likeness. However, the same analysis can be made specifically for 
publication by generalization. Assume BUREL outputs e ECs, and 
/ of those include QI attribute value tj. Let {Gi , . . . , G/} be the 
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set of ECs that contain tj, {G /-| 



. , G e } the set of all other ECs, 



and q\ the frequency of SA value Vi in EC Gk- Then it is: 
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The last inequality confirms our pre- 
vious result. For illustration, we es- 
timate Pr [tj J «j] values as in Equation 
(17) on the anonymized CENSUS data, 
using the first three attributes as QI, to 
predict the SA value of each tuple by 
Equation (15), for /3 6 {1,2,3,4,5}. " p ' J 

We obtain the success rate shown in the figure above. As expected, 
this success rate remains remarkably close to the frequency of the 
most frequent SA value in the data, namely 4.8402%. 

Last, we emphasize that /3-likeness is a privacy model for cate- 
gorical data. Its extension to numerical data is an interesting topic 
for future research. Such an extension should constrain not merely 
the variation in the frequencies of discrete numerical values, but 
rather of any values in close proximity to each other. Doing so, it 
would be immune to proximity attacks [19], as they apply on nu- 
merical data. In case proximity is defined for categorical data by a 
semantic hierarchy of categorical values, our model can be easily 
extended so as to treat all values beneath the same selected nodes 
in this hierarchy as the same, and ensure /3-likeness for such groups 
of values instead of leaf nodes in the hierarchy. We also emphasize 
that out model is built under the assumption that an attacker has no 
other prior knowledge apart from the overall distribution of sensi- 
tive values. Rastogi et al. [26] show that, if an adversary knows 
arbitrary correlations among tuples, there exists no useful anony- 
mization algorithm that can achieve both privacy and utility. 



8. CONCLUSION 

In this paper we revisited the microdata anonymization problem 
with three distinct contributions. First, we introduced /3-likeness, 
a robust privacy model that provides a comprehensible and intu- 
itively appealing privacy guarantee, expressed as a limit on the rel- 
ative confidence gain on each single sensitive attribute value. Sec- 
ond, we devised BUREL, a novel generalization algorithm explic- 
itly customized for this model. Third, we devised a perturbation 
technique for our model. Our experimental results confirm that al- 
gorithms developed for other privacy models cannot achieve strong 
guarantees in terms of /3-likeness, and verify the effectiveness and 
efficiency of both our schemes in their task. Apart from this ex- 
perimental study, we also provided arguments and results to the 
effect that the /3-likeness privacy guarantee affords genuine protec- 
tion against attacks suggested in previous research. In the future, 
we intend to extend our model to numerical sensitive attributes. 



We thank Daniel Kifer and Graham Cormode for lucid remarks on 
this topic, and the anonymous reviewers for their apt feedback. 
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